Sroka — a Python library to simplify data access
Article is available at medium under this link. Some say that in the data world, getting and preparing the data constitutes 80% of the whole process, leaving the rest to actual analysis and visualisation. That’s true in many cases, but one thing is certain, combining a few data sources together can be a tedious task. We, at Fandom, saw this as an area for improvement and built a Python library — Sroka (https://github.com/Wikia/sroka) to ease our lives. Sroka is a library that with simple Python functions lets us directly query or download data from various sources (through APIs), in an analysis-ready format. What is Sroka? Sroka is a bird species in Polish (eng. magpie). It is known for loving and collecting all shiny things. By analogy, Sroka the library, allows you to access all the important data and gather them in one place easily. Sroka provides a set of custom functions for each data source. Their descriptions are gathered in respective README.md files. It is a product of teamwork, you can find all contributors on github. The library currently supports following data sources: * AWS (Athena and S3) * Google Ad Manager * Google Analytics * Google Sheets * Moat * Qubole * Rubicon Why we built it? Working in the Ad Engineering team, we have access to many kinds of data — information that we gather in our own data warehouse, ads data from Google Ad Manager, webpage data from Google Analytics, and if that’s not enough — we also have data and dashboards provided by various third-party partners. Apart from differences in how data is aggregated and what information it provides, each of the APIs that we can use to gather data has its own way to query results. Each of them requires you to provide a separate set of credentials too. Sroka keeps them all in a sroka_config.ini file in your home directory and automatically passes them with your function calls. Who can use it? Anyone that uses data and has access to the above mentioned data sources. We have built it within an analytics “fraction” of the engineering team as it is analysts who used to query all of the sources on a daily basis. However, the ultimate goal is to make data querying easier and more accessible to a wider audience. Taking the data out of clickable UI’s into code based analysis encourages engineers to take part in the analysis process too. Example Before you use sroka, you have to install the library. You can do it using pip: pip install sroka Most of our analyses are conducted within Jupyter Notebooks. It is common for us to combine two (or more) data sources. Sometimes the goal is to combine different information from two sources, other times — we want to make sure that the data matches. The latter allows us to cross check e.g. whether our custom events are sent or whether Google Ad Manager is configured correctly. Sroka makes it very easy. We start off with all the right imports, including Sroka functions. Here we will use data stored in Athena and Google Ad Manager. # Athena API from sroka.api.athena.athena_api import query_athena, done_athena# GAM API from sroka.api.google_ad_manager.gam_api import get_data_from_admanager# data wrangling import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import matplotlib.dates as mdates# seaborn default plot design sns.set() We then query the sources directly from notebook. Simplified data acquisition and wrangling workflow would look like this: # GAM query start_day = '01' end_day = '31' start_month = '03' end_month = '03' year = '2019'query = """WHERE CUSTOM_TARGETING_VALUE_ID IN (wiki_ids)""" dimensions = 'DATE' columns = 'TOTAL_LINE_ITEM_LEVEL_IMPRESSIONS' start_date = {'year': year, 'month': start_month, 'day': start_day} stop_date = {'year': year, 'month': end_month, 'day': end_day}print('starting...') data_raw_gam = get_data_from_admanager(query, dimensions, columns, start_date, stop_date) print('data gathered')# create df copy df_gam = data_raw_gam.copy()# change column names df_gam.rename( columns={ 'Dimension.DATE': 'Date', 'Column.TOTAL_LINE_ITEM_LEVEL_IMPRESSIONS': 'Impressions' }, inplace=True)# change column format to datetime, it is needed to plot df_gam'Date' = pd.to_datetime(df_gam'Date')# set index df_gam.set_index('Date', inplace=True)# check first rows df_gam.head() Querying and preparing data from Athena: # download Athena data df_athena = query_athena( ' SELECT concat(year, '-', month, '-', day) AS day, count(ad_impressions) FROM fandom_ads_data_table WHERE year = '2019' AND month = '03' AND wiki_id IN (wiki_ids) GROUP BY CONCAT(year, '-', month, '-', day) ORDER BY day ASC ')# change column names df_athena.rename( columns={ 'day': 'Date', 'countad_impressions': 'Impressions' }, inplace=True)# change column format to datetime, it is needed to plot df_athena'Date' = pd.to_datetime(df_athena'Date')# set index df_athena.set_index('Date', inplace=True) df_athena.head() Having both data sets prepared, we can visualise and compare the results. # initiate plot fig, ax = plt.subplots(figsize=(15,6)) ax.plot(df_athena'Impressions', label='Athena') ax.plot(df_gam'Impressions', label='Google Ad Manager')# plot elements, title, labels etc. plt.ylabel('# impressions', fontsize=13) plt.title('wiki_id all ad impressions by source', fontsize=16, pad=20) plt.xticks(rotation=30) plt.legend(bbox_to_anchor=(1.25, 0.5), frameon=False, fontsize=14) plt.ylim(0) plt.yticks([]) ax.xaxis.set_major_locator(mdates.DayLocator(interval=3)) ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y/%m/%d')); With these few code cells we are able to get a lot of insight, with no need to jump between data sources and manually downloading reports. We can immediately see that both data sources follow the same pattern, it shows seasonality, but also that as the impressions increase, so do the discrepancies between data sources. Depending on the situation, this is maybe a known issue, or a signal to verify the reporting. What’s next? We hope that Sroka will help to simplify the process of data acquisition in places outside of Fandom. If you have any suggestions on how to improve or extend Sroka’s capabilities, feel free to open an issue on GitHub (https://github.com/Wikia/sroka). Also if you feel that additional data sources should be included everyone is welcome to contribute and create a PR. Category:Technical articles